Let’s assume that we have the population with a total of 10 subjects. Suppose we label them from 1 to 10 and randomly would like to select 3 subjects we can do this using the sample function. When we run sample another time, different subjects will be selected. Try this a couple times.
sample(10, 3)
## [1] 8 1 9
sample(10, 3)
## [1] 3 9 10
Now suppose we would like to select the same randomly selected samples every time, then we can use a random seed number.
set.seed(3728)
sample(10, 3)
## [1] 5 8 7
set.seed(3728)
sample(10, 3)
## [1] 5 8 7
Let’s practice with fun example. Select two in our group member for coming early next Monday.
group.member <- c('Meena', 'Tsung-Heng', 'Ting', 'April', 'Dan', 'Cyril', 'Kylie', 'Sara')
sample(group.member, 2)
## [1] "Dan" "Cyril"
We can also create a random order using all elements of iPRG dataset. Again, we can achieve this using sample, asking for exactly the amount of samples in the subset. This time, each repetition gives us a different order of the complete set.
msrun <- unique(iprg$Run)
## Error in unique(iprg$Run): object 'iprg' not found
msrun
## Error in eval(expr, envir, enclos): object 'msrun' not found
## randomize order among all 12 MS runs
sample(msrun, length(msrun))
## Error in sample(msrun, length(msrun)): object 'msrun' not found
## different order will be shown.
sample(msrun, length(msrun))
## Error in sample(msrun, length(msrun)): object 'msrun' not found
Allow to remove known sources of variability that you are not interested in.
Group conditions into blocks such that the conditions in a block are as similar as possible.
Randomly assign samples with a block.
This particular dataset contains a total of 12 MS runs across 4 conditions, 3 technical replicates per condition. Using the block.random function in the psych package, we can achieve randomized block designs! block.random function makes random assignment of n subjects with an equal number in all of N conditions.
library("psych") ## load the psych package
msrun <- unique(iprg[, c('Condition','Run')])
## Error in unique(iprg[, c("Condition", "Run")]): object 'iprg' not found
msrun
## Error in eval(expr, envir, enclos): object 'msrun' not found
## 4 Conditions of 12 MS runs randomly ordered
block.random(n = 12, c(Condition = 4))
## blocks Condition
## S1 1 2
## S2 1 4
## S3 1 3
## S4 1 1
## S5 2 1
## S6 2 2
## S7 2 4
## S8 2 3
## S9 3 4
## S10 3 3
## S11 3 2
## S12 3 1
block.random(n = 12, c(Condition = 4, BioReplicate=3))
## blocks Condition BioReplicate
## S1 1 1 3
## S2 1 1 2
## S3 1 4 2
## S4 1 2 2
## S5 1 3 2
## S6 1 2 3
## S7 1 1 1
## S8 1 4 1
## S9 1 3 3
## S10 1 3 1
## S11 1 2 1
## S12 1 4 3
Let’s start data with one protein as an example and calculate the mean, standard deviation, standard error of the mean across all replicates per condition. We then store all the computed statistics into a single summary data frame for easy access.
We can use the aggregate function to compute summary statistics. aggregate splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.
# check what proteins are in dataset, show all protein names
head(unique(iprg$Protein))
## Error in unique(iprg$Protein): object 'iprg' not found
# Let's start with one protein, named "sp|P44015|VAC2_YEAST"
oneproteindata <- iprg[iprg$Protein == "sp|P44015|VAC2_YEAST", ]
## Error in eval(expr, envir, enclos): object 'iprg' not found
# there are 12 rows in oneproteindata
oneproteindata
## Error in eval(expr, envir, enclos): object 'oneproteindata' not found
# If you want to see more details,
?aggregate
## splits 'oneproteindata' into subsets by 'Condition',
## then, compute 'FUN=mean' of 'log2Int'
sub.mean <- aggregate(Log2Intensity ~ Condition,
data = oneproteindata,
FUN = mean)
## Error in eval(m$data, parent.frame()): object 'oneproteindata' not found
sub.mean
## Error in eval(expr, envir, enclos): object 'sub.mean' not found
\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2} \]
Challenge
Using the
aggregatefunction above, calculate the standard deviation, by applying themedianfunction.
## The same as mean calculation above. 'FUN' is changed to 'sd'.
sub.median <- aggregate(Log2Intensity ~ Condition,
data = oneproteindata, FUN = median)
## Error in eval(m$data, parent.frame()): object 'oneproteindata' not found
sub.median
## Error in eval(expr, envir, enclos): object 'sub.median' not found
Using the
aggregatefunction above, calculate the standard deviation, by applying thesdfunction.
## The same as mean calculation above. 'FUN' is changed to 'sd'.
sub.sd <- aggregate(Log2Intensity ~ Condition,
data = oneproteindata, FUN = sd)
## Error in eval(m$data, parent.frame()): object 'oneproteindata' not found
sub.sd
## Error in eval(expr, envir, enclos): object 'sub.sd' not found
Challenge
Using the
aggregatefunction above, count the number of observations per group with thelengthfunction.
## The same as mean calculation. 'FUN' is changed 'length'.
sub.len <- aggregate(Log2Intensity ~ Condition,
data = oneproteindata,
FUN = length)
## Error in eval(m$data, parent.frame()): object 'oneproteindata' not found
sub.len
## Error in eval(expr, envir, enclos): object 'sub.len' not found
\[ SE = \sqrt{\frac{s^2}{n}} \]
sub.se <- sqrt(sub.sd$Log2Intensity^2 / sub.len$Log2Intensity)
## Error in eval(expr, envir, enclos): object 'sub.sd' not found
sub.se
## Error in eval(expr, envir, enclos): object 'sub.se' not found
We can now make the summary table including the results above (mean, sd, se and length).
## paste0 : concatenate vectors after convering to character.
(grp <- paste0("Condition", 1:4))
## [1] "Condition1" "Condition2" "Condition3" "Condition4"
## It is equivalent to paste("Condition", 1:4, sep="")
summaryresult <- data.frame(Group = grp,
mean = sub.mean$Log2Intensity,
sd = sub.sd$Log2Intensity,
se = sub.se,
length = sub.len$Log2Intensity)
## Error in data.frame(Group = grp, mean = sub.mean$Log2Intensity, sd = sub.sd$Log2Intensity, : object 'sub.mean' not found
summaryresult
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
error bars can have a variety of meanings or conclusions if what they represent is not precisely specified. Below we provide some examples of which types of error bars are common. We’re using the summary of protein sp|P44015|VAC2_YEAST from the previous section and the ggplot2 package as it provides a convenient way to make easily adaptable plots.
# means without any errorbar
p <- ggplot(aes(x = Group, y = mean, colour = Group),
data = summaryresult)+
geom_point(size = 3)
## Error in ggplot(aes(x = Group, y = mean, colour = Group), data = summaryresult): could not find function "ggplot"
p
## Error in eval(expr, envir, enclos): object 'p' not found
Let’s change a number of visual properties to make the plot more attractive.
labs(title="Mean", x="Condition", y='Log2(Intensity)')theme_bw()'none' to remove it)p2 <- p + labs(title = "Mean", x = "Group", y = 'Log2(Intensity)') +
theme_bw() +
theme(plot.title = element_text(size = 25, colour = "darkblue"),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text.x = element_text(size = 13),
legend.position = 'bottom',
legend.background = element_rect(colour = 'black'),
legend.key = element_rect(colour = 'white'))
## Error in eval(expr, envir, enclos): object 'p' not found
p2
## Error in eval(expr, envir, enclos): object 'p2' not found
Let’s now add the standard deviation:
# mean with SD
p2 + geom_errorbar(aes(ymax = mean + sd, ymin = mean - sd), width = 0.1) +
labs(title="Mean with SD")
## Error in eval(expr, envir, enclos): object 'p2' not found
Challenge
Add the standard error of the mean. Which one is smaller?
# mean with SE
p2 + geom_errorbar(aes(ymax = mean + se, ymin=mean - se), width = 0.1) +
labs(title="Mean with SE")
## Error in eval(expr, envir, enclos): object 'p2' not found
## The SE is narrow than the SD!
Challenge
Add the standard error of the mean with black color.
# mean with SE
p2 + geom_errorbar(aes(ymax = mean + se, ymin=mean - se), width = 0.1, color='black') +
labs(title="Mean with SE")
## Error in eval(expr, envir, enclos): object 'p2' not found
For each statistical distribution, we have function to compute
For the normale distribution norm, these are respectively
dnormpnormqnormrnormLet’s start by sampling 1000000 values from a normal distribution \(N(0, 1)\):
xn <- rnorm(1e6)
hist(xn, freq = FALSE)
rug(xn)
lines(density(xn), lwd = 2)
By definition, the area under the density curve is 1. The area at the left of 0, 1, and 2 are respectively:
pnorm(0)
## [1] 0.5
pnorm(1)
## [1] 0.8413447
pnorm(2)
## [1] 0.9772499
To ask the inverse question, we use the quantile function. The obtain 0.5, 0.8413447 and 0.9772499 of our distribution, we need means of:
qnorm(0.5)
## [1] 0
qnorm(pnorm(1))
## [1] 1
qnorm(pnorm(2))
## [1] 2
Finally, the density function gives us the height at which we are for a given mean:
hist(xn, freq = FALSE)
lines(density(xn), lwd = 2)
points(0, dnorm(0), pch = 19, col = "red")
points(1, dnorm(1), pch = 1, col = "blue")
points(2, dnorm(2), pch = 4, col = "orange")
Now that we’ve covered the standard error of the mean and the standard deviation, let’s investigate how we can add custom confidence intervals (CI) for our measurement of the mean. We’ll add these CI’s to the summary results we previously stored for protein sp|P44015|VAC2_YEAST.
Confidence interval:
\[\mbox{mean} \pm (SE \times \frac{\alpha}{2} ~ \mbox{quantile of t distribution})\]
To calculate the 95% confident interval, we need to be careful and set the quantile for two-sided. We need to divide by two for error. For example, 95% confidence interval, right tail is 2.5% and left tail is 2.5%.
summaryresult$ciw.lower.95 <- summaryresult$mean -
qt(0.975, summaryresult$len - 1) * summaryresult$se
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
summaryresult$ciw.upper.95 <- summaryresult$mean +
qt(0.975, summaryresult$len - 1) * summaryresult$se
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
summaryresult
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
# mean with 95% two-sided confidence interval
ggplot(aes(x = Group, y = mean, colour = Group),
data = summaryresult) +
geom_point() +
geom_errorbar(aes(ymax = ciw.upper.95, ymin = ciw.lower.95), width = 0.1) +
labs(title="Mean with 95% confidence interval", x="Condition", y='Log2(Intensity)') +
theme_bw() +
theme(plot.title = element_text(size=25, colour="darkblue"),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
axis.text.x = element_text(size=13),
legend.position = 'bottom',
legend.background = element_rect(colour = 'black'),
legend.key = element_rect(colour='white'))
## Error in ggplot(aes(x = Group, y = mean, colour = Group), data = summaryresult): could not find function "ggplot"
Challenges
Replicate the above for the 99% two-sided confidence interval.
# mean with 99% two-sided confidence interval
summaryresult$ciw.lower.99 <- summaryresult$mean - qt(0.995,summaryresult$len-1) * summaryresult$se
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
summaryresult$ciw.upper.99 <- summaryresult$mean + qt(0.995,summaryresult$len-1) * summaryresult$se
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
summaryresult
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
ggplot(aes(x = Group, y = mean, colour = Group),
data = summaryresult) +
geom_point() +
geom_errorbar(aes(ymax = ciw.upper.99, ymin=ciw.lower.99), width=0.1) +
labs(title="Mean with 99% confidence interval", x="Condition", y='Log2(Intensity)') +
theme_bw()+
theme(plot.title = element_text(size=25, colour="darkblue"),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
axis.text.x = element_text(size=13),
legend.position = 'bottom',
legend.background = element_rect(colour='black'),
legend.key = element_rect(colour='white'))
## Error in ggplot(aes(x = Group, y = mean, colour = Group), data = summaryresult): could not find function "ggplot"
Error bars with SD and CI are overlapping between groups!
Error bars for the SD show the spread of the population while error bars based on SE reflect the uncertainty in the mean and depend on the sample size.
Confidence intervals of n on the other hand mean that the intervals capture the population mean n percent of the time.
When the sample size increases, CI and SE are getting closer to each other.
We have two objects that contain all the information that we have generated so far:
summaryresults object, that contains all the summary statistics.iprg data frame, that was read from the csv file. This object can be easily regenerated using read.csv, and hence doesn’t necessarily to be saved explicity.save(summaryresult, file = "./data/summaryresults.rda")
## Error in save(summaryresult, file = "./data/summaryresults.rda"): object 'summaryresult' not found
save(iprg, file = "./data/iprg.rda")
## Error in save(iprg, file = "./data/iprg.rda"): object 'iprg' not found
We can also save the summary result as a csv file using the write.csv function:
write.csv(sumamryresult, file = "./data/summary.csv")
Tip: Exporting to csv is useful to share your work with collaborators that do not use R, but for wany continous work in R, to assure data validity accords platforms, the best format is rda.
First, we are going to prepare the session for further analyses.
load("./data/summaryresults.rda")
load("./data/iprg.rda")
Now, we’ll perform a t-test whether protein sp|P44015|VAC2_YEAST has a change in abundance between Condition 1 and Condition 2.
with
## Let's start with one protein, named "sp|P44015|VAC2_YEAST"
oneproteindata <- iprg[iprg$Protein == "sp|P44015|VAC2_YEAST", ]
## Then, get two conditions only, because t.test only works for two groups (conditions).
oneproteindata.condition12 <- oneproteindata[oneproteindata$Condition %in%
c('Condition1', 'Condition2'), ]
oneproteindata.condition12
## Protein Log2Intensity Run
## 21096 sp|P44015|VAC2_YEAST 26.30163 JD_06232014_sample1_B.raw
## 21097 sp|P44015|VAC2_YEAST 26.11643 JD_06232014_sample1_C.raw
## 21098 sp|P44015|VAC2_YEAST 26.29089 JD_06232014_sample1-A.raw
## 21099 sp|P44015|VAC2_YEAST 25.81957 JD_06232014_sample2_A.raw
## 21100 sp|P44015|VAC2_YEAST 26.11527 JD_06232014_sample2_B.raw
## 21101 sp|P44015|VAC2_YEAST 26.08498 JD_06232014_sample2_C.raw
## Condition BioReplicate Intensity TechReplicate
## 21096 Condition1 1 82714388 B
## 21097 Condition1 1 72749239 C
## 21098 Condition1 1 82100518 A
## 21099 Condition2 2 59219741 A
## 21100 Condition2 2 72690802 B
## 21101 Condition2 2 71180513 C
table(oneproteindata.condition12[, c("Condition", "BioReplicate")])
## BioReplicate
## Condition 1 2
## Condition1 3 0
## Condition2 0 3
If we want to remove the levels that are not relevant anymore, we can use droplevels:
oneproteindata.condition12 <- droplevels(oneproteindata.condition12)
table(oneproteindata.condition12[, c("Condition", "BioReplicate")])
## BioReplicate
## Condition 1 2
## Condition1 3 0
## Condition2 0 3
To perform the t-test, we use the t.test function. Let’s first familiarise ourselves with it by looking that the manual
?t.test
And now apply to to our data
# t test for different abundance (log2Int) between Groups (Condition)
result <- t.test(Log2Intensity ~ Condition,
data = oneproteindata.condition12,
var.equal = FALSE)
result
##
## Welch Two Sample t-test
##
## data: Log2Intensity by Condition
## t = 2.0608, df = 3.4001, p-value = 0.1206
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1025408 0.5619598
## sample estimates:
## mean in group Condition1 mean in group Condition2
## 26.23632 26.00661
Challenge
Repeat the t-test above but with calculating a 90% confidence interval for the log2 fold change.
htest classThe t.test function, like other hypothesis testing function, return a result of a type we haven’t encountered yet, the htest class:
class(result)
## [1] "htest"
which stores typical results from such tests. Let’s have a more detailed look at what information we can learn from the results our t-test. When we type the name of our result object, we get a short textual summary, but the object contains more details:
names(result)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
and we can access each of these by using the $ operator, like we used to access a single column from a data.frame, but the htest class is not a data.frame (it’s actually a list). For example, to access the group means, we would use
result$estimate
## mean in group Condition1 mean in group Condition2
## 26.23632 26.00661
Challenge
- Calculate the (log2-transformed) fold change between groups
- Extract the value of the t-statistics
- Calculate the standard error (fold-change/t-statistics)
- Extract the degrees of freedom (parameter)
- Extract the p values
- Extract the 95% confidence intervals
We can also manually compute our t-test statistic using the formulas we descibed above and compare it with the summaryresult.
Recall the summaryresult we generated last section.
summaryresult
## Group mean sd se length ciw.lower.95
## 1 Condition1 26.23632 0.10396539 0.06002444 3 25.97805
## 2 Condition2 26.00661 0.16268179 0.09392438 3 25.60248
## 3 Condition3 23.25609 0.09467798 0.05466236 3 23.02090
## 4 Condition4 20.97056 0.73140174 0.42227499 3 19.15366
## ciw.upper.95 ciw.lower.99 ciw.upper.99
## 1 26.49458 25.64058 26.83205
## 2 26.41073 25.07442 26.93879
## 3 23.49128 22.71357 23.79860
## 4 22.78746 16.77955 25.16157
summaryresult12 <- summaryresult[1:2, ]
## test statistic, It is the same as 'result$statistic' above.
diff(summaryresult12$mean) ## different sign, but absolute values are same as result$estimate[1]-result$estimate[2]
## [1] -0.2297095
sqrt(sum(summaryresult12$sd^2/summaryresult12$length)) ## same as stand error
## [1] 0.1114662
## the t-statistic : sign is different
diff(summaryresult12$mean)/sqrt(sum(summaryresult12$sd^2/summaryresult12$length))
## [1] -2.060799
See the previous Working with statistical distributions for a reminder.
Referring back to our t-test results above, we can manually calculate the one- and two-side tests p-values using the t-statistics and the test parameter (using the pt function).
Our result t statistic was 2.0607988 (accessible with result$statistic). Let’s start by visualising it along a t distribution. Let’s create data from such a distribution, making sure we set to appropriate parameter.
## generate 10^5 number with the same degree of freedom for distribution.
xt <- rt(1e5, result$parameter)
plot(density(xt), xlim = c(-10, 10))
abline(v = result$statistic, col = "red") ## where t statistics are located.
abline(h = 0, col = "gray") ## horizontal line at 0
The area on the left of that point is given by pt(result$statistic, result$parameter), which is 0.939693. The p-value for a one-sided test, which is ** the area on the right** of red line, is this given by
1 - pt(result$statistic, result$parameter)
## t
## 0.06030697
And the p-value for a two-sided test is
2 * (1 - pt(result$statistic, result$parameter))
## t
## 0.1206139
which is the same as the one calculated by the t-test.
To calculate the required sample size, you’ll need to know four things:
Assuming equal varaince and number of samples across groups, the following formula is used for sample size estimation:
\[\frac{2{\sigma}^2}{n}\leq(\frac{\Delta}{z_{1-\beta}+z_{1-\alpha/2}})^2\]
library("pwr")
## ?pwr.t.test
# Significance level alpha
alpha <- 0.05
# Power = 1 - beta
power <- 0.95
# anticipated log2 fold change
delta <- 1
# anticipated variability
sigma <- 0.9
# Effect size
# It quantifies the size of the difference between two groups
d <- delta/sigma
#Sample size estimation
pwr.t.test(d = d, sig.level = alpha, power = power, type = 'two.sample')
##
## Two-sample t test power calculation
##
## n = 22.06036
## d = 1.111111
## sig.level = 0.05
## power = 0.95
## alternative = two.sided
##
## NOTE: n is number in *each* group
Challenge
- Calculate power with 10 samples and the same parameters as above.
Let’s investigate the effect of required fold change and variance on the sample size estimation.
# anticipated log2 fold change
delta <- seq(0.1, 0.7, .1)
nd <- length(delta)
# anticipated variability
sigma <- seq(0.1,0.5,.1)
ns <- length(sigma)
# obtain sample sizes
samsize <- matrix(0, nrow=ns*nd, ncol = 3)
counter <- 0
for (i in 1:nd){
for (j in 1:ns){
result <- pwr.t.test(d = delta[i]/sigma[j],
sig.level = alpha,
power = power,
type = "two.sample")
counter <- counter + 1
samsize[counter,1] <- delta[i]
samsize[counter,2] <- sigma[j]
samsize[counter,3] <- ceiling(result$n)
}
}
colnames(samsize) <- c("desiredlog2FC","variability","samplesize")
library("ggplot2")
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
samsize <- as.data.frame(samsize)
samsize$variability <- as.factor(samsize$variability)
ggplot(data=samsize, aes(x=desiredlog2FC, y=samplesize, group = variability, colour = variability)) +
geom_line() +
geom_point(size=2, shape=21, fill="white") +
labs(title="Significance level=0.05, Power=0.95", x="Anticipated log2 fold change", y='Sample Size (n)') +
theme(plot.title = element_text(size=20, colour="darkblue"),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
axis.text.x = element_text(size=13))
The decision of which statistical model is appropriate for a given set of observations depends on the type of data that have been collected.
Quantitative response with quantitative predictors : regression model
Categorical response with quantitative predictors : logistic regression model for bivariate categorical response (e.g., Yes/No, dead/alive), multivariate logistic regression model when the response variable has more than two possible values.
Quantitative response with categorical predictors : ANOVA model (quantitative response across several populations defined by one or more categorical predictor variables)
Categorical response with categorical predictors : contingency table that can be used to draw conclusions about the relationships between variables.
Part 5b are adapted from Bremer & Doerge, Using R at the Bench : Step-by-Step Data Analytics for Biologists , cold Spring Harbor LaboratoryPress, 2015.
For this part, we are going to use a new dataset, which contains the patient information from TCGA colorectal cohort. This data is from Proteogenomic characterization of human colon and rectal cancer (Zhang et al. 2014).
Let’s read in and explore the TCGA colorectal cancer data:
TCGA.CRC <- read.csv("./data/TCGA_sample_information.csv")
head(TCGA.CRC)
## TCGA.participant.ID Gender Cancer BRAF.mutation history_of_colon_polyps
## 1 TCGA-A6-3807 Female Colon 0 NO
## 2 TCGA-A6-3808 Male Colon 0 YES
## 3 TCGA-A6-3810 Male Colon 0 YES
## 4 TCGA-AA-3518 Female Colon 0 NO
## 5 TCGA-AA-3525 Male Colon 1 NO
## 6 TCGA-AA-3526 Male Colon 0 YES
Rows in the data array are patients and columns are patient information. The column definition is shown as following:
| Variable |
|---|
| TCGA participant ID |
| Gender |
| Cancer type |
| BTAF mutation status |
| History of colon polyps |
Let’s be familiar with data first.
## check the dimension of data
dim(TCGA.CRC)
## [1] 78 5
dim(TCGA.CRC$Gender) # dim function is for matrix, array or data frame.
## NULL
## check unique information of Gender column.
unique(TCGA.CRC$Gender)
## [1] Female Male
## Levels: Female Male
class(TCGA.CRC$Gender)
## [1] "factor"
Challenge
- Get unique information and class for
Cancerinformation- Get unique information and class for
BRAF.mutationinformation- Get unique information and class for
history_of_colon_polypsinformation
table function generates contingency table (or classification table with frequency of outcomes) at each combination.
## check how many female and male are in the dataset
table(TCGA.CRC$Gender)
##
## Female Male
## 34 44
## check unique information if paticipant ID
unique(TCGA.CRC$TCGA.participant.ID)
## [1] TCGA-A6-3807 TCGA-A6-3808 TCGA-A6-3810 TCGA-AA-3518 TCGA-AA-3525
## [6] TCGA-AA-3526 TCGA-AA-3529 TCGA-AA-3531 TCGA-AA-3534 TCGA-AA-3552
## [11] TCGA-AA-3554 TCGA-AA-3558 TCGA-AA-3561 TCGA-AA-3664 TCGA-AA-3666
## [16] TCGA-AA-3672 TCGA-AA-3684 TCGA-AA-3695 TCGA-AA-3710 TCGA-AA-3715
## [21] TCGA-AA-3818 TCGA-AA-3848 TCGA-AA-3986 TCGA-AA-3989 TCGA-AA-A004
## [26] TCGA-AA-A00A TCGA-AA-A00F TCGA-AA-A00K TCGA-AA-A00N TCGA-AA-A00U
## [31] TCGA-AA-A010 TCGA-AA-A01D TCGA-AA-A01F TCGA-AA-A01I TCGA-AA-A01K
## [36] TCGA-AA-A01P TCGA-AA-A01R TCGA-AA-A01S TCGA-AA-A01T TCGA-AA-A01V
## [41] TCGA-AA-A01X TCGA-AA-A01Z TCGA-AA-A022 TCGA-AA-A024 TCGA-AA-A029
## [46] TCGA-AA-A02H TCGA-AA-A02O TCGA-AA-A02Y TCGA-AA-A03F TCGA-AA-A03J
## [51] TCGA-AF-2691 TCGA-AF-2692 TCGA-AF-3400 TCGA-AF-3913 TCGA-AG-3574
## [56] TCGA-AG-3580 TCGA-AG-3584 TCGA-AG-3593 TCGA-AG-A002 TCGA-AG-A008
## [61] TCGA-AG-A00C TCGA-AG-A00H TCGA-AG-A00Y TCGA-AG-A011 TCGA-AG-A014
## [66] TCGA-AG-A015 TCGA-AG-A01L TCGA-AG-A01W TCGA-AG-A01Y TCGA-AG-A020
## [71] TCGA-AG-A026 TCGA-AG-A02N TCGA-AG-A02X TCGA-AG-A032
## 74 Levels: TCGA-A6-3807 TCGA-A6-3808 TCGA-A6-3810 ... TCGA-AG-A032
length(unique(TCGA.CRC$TCGA.participant.ID))
## [1] 74
There are 74 unique participant IDs. but there are total 78 rows. We can suspect that there are multiple rows for some IDs. Let’s check. Get the count per participants ID
countID <- table(TCGA.CRC$TCGA.participant.ID)
countID
##
## TCGA-A6-3807 TCGA-A6-3808 TCGA-A6-3810 TCGA-AA-3518 TCGA-AA-3525
## 1 1 1 1 1
## TCGA-AA-3526 TCGA-AA-3529 TCGA-AA-3531 TCGA-AA-3534 TCGA-AA-3552
## 1 1 1 1 1
## TCGA-AA-3554 TCGA-AA-3558 TCGA-AA-3561 TCGA-AA-3664 TCGA-AA-3666
## 1 1 1 1 1
## TCGA-AA-3672 TCGA-AA-3684 TCGA-AA-3695 TCGA-AA-3710 TCGA-AA-3715
## 1 1 1 1 1
## TCGA-AA-3818 TCGA-AA-3848 TCGA-AA-3986 TCGA-AA-3989 TCGA-AA-A004
## 1 1 1 1 1
## TCGA-AA-A00A TCGA-AA-A00F TCGA-AA-A00K TCGA-AA-A00N TCGA-AA-A00U
## 2 1 2 2 1
## TCGA-AA-A010 TCGA-AA-A01D TCGA-AA-A01F TCGA-AA-A01I TCGA-AA-A01K
## 1 1 1 1 1
## TCGA-AA-A01P TCGA-AA-A01R TCGA-AA-A01S TCGA-AA-A01T TCGA-AA-A01V
## 1 1 1 1 1
## TCGA-AA-A01X TCGA-AA-A01Z TCGA-AA-A022 TCGA-AA-A024 TCGA-AA-A029
## 1 1 1 1 1
## TCGA-AA-A02H TCGA-AA-A02O TCGA-AA-A02Y TCGA-AA-A03F TCGA-AA-A03J
## 1 1 1 1 1
## TCGA-AF-2691 TCGA-AF-2692 TCGA-AF-3400 TCGA-AF-3913 TCGA-AG-3574
## 1 1 1 1 1
## TCGA-AG-3580 TCGA-AG-3584 TCGA-AG-3593 TCGA-AG-A002 TCGA-AG-A008
## 1 1 1 1 1
## TCGA-AG-A00C TCGA-AG-A00H TCGA-AG-A00Y TCGA-AG-A011 TCGA-AG-A014
## 1 2 1 1 1
## TCGA-AG-A015 TCGA-AG-A01L TCGA-AG-A01W TCGA-AG-A01Y TCGA-AG-A020
## 1 1 1 1 1
## TCGA-AG-A026 TCGA-AG-A02N TCGA-AG-A02X TCGA-AG-A032
## 1 1 1 1
Some participant IDs has 2 rows. Let’s get the subset that has more than one row.
unique(countID)
## [1] 1 2
countID[countID > 1]
##
## TCGA-AA-A00A TCGA-AA-A00K TCGA-AA-A00N TCGA-AG-A00H
## 2 2 2 2
4 participants have two rows per each. Let’s check the rows for parcipant ID = TCGA-AA-A00A
TCGA.CRC[TCGA.CRC$TCGA.participant.ID == 'TCGA-AA-A00A', ]
## TCGA.participant.ID Gender Cancer BRAF.mutation history_of_colon_polyps
## 26 TCGA-AA-A00A Male Colon 0 NO
## 27 TCGA-AA-A00A Male Colon 0 NO
There are two rows with the same information. They are duplicated rows. Let’s remove them.
TCGA.CRC <- TCGA.CRC[!duplicated(TCGA.CRC), ]
Challenge
- Check whether dimension and number of participants ID are changed after removing duplicated rows.
table function also can calculate 2-way contingency table, which is frequency outcomes of two categorical variables. Let’s get the table and visualise the count data.
cancer.polyps <- table(TCGA.CRC$history_of_colon_polyps, TCGA.CRC$Cancer)
cancer.polyps
##
## Colon Rectum
## NO 29 19
## YES 21 5
dotchart(cancer.polyps, xlab = "Observed counts")
Hypothesis in general :
\(H_0\) : each population has the same proportion of observations, \(\pi_{j=1|i=1} = \pi_{j=1|i=2}\)
\(H_a\) : different population have different proportion of observations.
Hypothesis that we are interested with this example: whether the proportion of patients who have history of colon polyps in the patients with colon cancer is different from that in the patients with rectal cancer.
polyps <- cancer.polyps[2, ]
cancer <- apply(cancer.polyps, 2, sum)
pt <- prop.test(polyps, cancer)
pt
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: polyps out of cancer
## X-squared = 2.3268, df = 1, p-value = 0.1272
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.03156841 0.45490175
## sample estimates:
## prop 1 prop 2
## 0.4200000 0.2083333
# name of output
names(pt)
## [1] "statistic" "parameter" "p.value" "estimate" "null.value"
## [6] "conf.int" "alternative" "method" "data.name"
# proportion in each group
pt$estimate
## prop 1 prop 2
## 0.4200000 0.2083333
# test statistic value
pt$statistic
## X-squared
## 2.32678
# degree of freedom
pt$parameter
## df
## 1
Hypothesis in general :
\(H_0\) : the factors are independent.
\(H_a\) : the factors are not independent.
Hypothesis that we are interested with this example: whether history of colon polyps and type of colon cancer are independent or not.
\[\chi^2 =\sum_{i=1}^2 \sum_{j=1}^2 \frac{(O_{ij}-E_{ij})^2}{E_{ij}} \sim \chi^2_{(2-1)(2-1)}\]
\(O_{ij}\) : \(n_{ij}\), which is the count within the cells
\(E_{ij}\) : \(n_{i+}n_{+j}/n\), where \(n_{i+}\) is the row count sum, \(n_{+j}\) is the column count sum and n is the total count.
!! assumption : the distribution of the test statistic is approximately chi-square if the expected counts are large enough. Use this test only if the expected count for each cell is greater than 5.
chisq.test(cancer.polyps)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cancer.polyps
## X-squared = 2.3268, df = 1, p-value = 0.1272
Mathematically, two tests above are equivalent. prop.test() uses chisq.test() internally and print output differently.
Fisher’s exact test is a non-parametric test for independence. It compares the distributions of counts within 4 cells. p-value for Fisher’s exact test is the probability of obtaining more extreme data by chance than the data observed if the row and column totals are held fixed.
There are no assumptions on distributions or sample sizes for this test. Therefore, the Fisher’s exact test can be used with small sample sizes. However, if the sample sizes are very small, then the power of the test will be very low.
Apply the Fisher’s exact test using the fisher.test function and extract the odds ratio.
ft <- fisher.test(cancer.polyps)
ft
##
## Fisher's Exact Test for Count Data
##
## data: cancer.polyps
## p-value = 0.1177
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.09234132 1.24084405
## sample estimates:
## odds ratio
## 0.3681978
and extract the odds ratio.
ft$estimate
## odds ratio
## 0.3681978
Challenge
- Compare the proportion of male patients in the patients with colon cancer is different from that in the patients with rectal cancer.
We could also apply a z-test for comparison of two proportions, defined as
\[Z=\frac{\widehat{p}_1-\widehat{p}_2}{\sqrt{\widehat{p} (1- \widehat{p}) (\frac{1}{n_1} + \frac{1}{n_2})}}\]
where \(\widehat{\pi}_1 = \frac{y_{1}}{n_1}\), \(\widehat{\pi}_2 = \frac{y_{2}}{n_2}\) and \(\widehat{p}=\frac{x_1 + x_2}{n_1 + n_2}\).
We are going to use this test to illustrate how to write functions in R.
An R function is created with the function constructor, named function, and is composed of:
z.prop.p;x1 and x2, and the total number of observation in each category, n1 and n2;pvalue;z.prop.p <- function(x1, x2, n1, n2) {
pi_1 <- x1/n1
pi_2 <- x2/n2
numerator <- pi_1 - pi_2
p_common <- (x1+x2)/(n1+n2)
denominator <- sqrt(p_common * (1-p_common) * (1/n1 + 1/n2))
stat <- numerator/denominator
pvalue <- 2 * (1 - pnorm(abs(stat)))
return(pvalue)
}
z.prop.p(cancer.polyps[2,1], cancer.polyps[2,2], sum(cancer.polyps[,1]), sum(cancer.polyps[,2]))
## [1] 0.07418571
Challenge
Write a function named
f2c(c2f) that converts a temperature from Fahrenheit to Celsium (Celsium to Fahrenheit) using the following formula \(F = C \times 1.8 + 32\) (\(C = \frac{F - 32}{1.8}\)).
When considering correlations and modelling data, visualisation is key.
Let’s use the famous Anscombe’s quartet data as a motivating example. This data is composed of 4 pairs of values, \((x_1, y_1)\) to \((x_4, y_4)\):
| x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 |
|---|---|---|---|---|---|---|---|
| 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
| 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
| 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
| 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
| 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
| 14 | 14 | 14 | 8 | 9.96 | 8.10 | 8.84 | 7.04 |
| 6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
| 4 | 4 | 4 | 19 | 4.26 | 3.10 | 5.39 | 12.50 |
| 12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
| 7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
| 5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
Each of these \(x\) and \(y\) sets have the same variance, mean and correlation:
| 1 | 2 | 3 | 4 | |
|---|---|---|---|---|
| var(x) | 11.0000000 | 11.0000000 | 11.0000000 | 11.0000000 |
| mean(x) | 9.0000000 | 9.0000000 | 9.0000000 | 9.0000000 |
| var(y) | 4.1272691 | 4.1276291 | 4.1226200 | 4.1232491 |
| mean(y) | 7.5009091 | 7.5009091 | 7.5000000 | 7.5009091 |
| cor(x,y) | 0.8164205 | 0.8162365 | 0.8162867 | 0.8165214 |
But…
While the residuals of the linear regression clearly indicate fundamental differences in these data, the most simple and straightforward approach is visualisation to highlight the fundamental differences in the datasets.
See also another, more recent example: The Datasaurus Dozen dataset.
Here is an example where a wide format comes very handy. We are going to convert our iPRG data using the spread function from the tidyr package:
library("tidyr")
iprg2 <- spread(iprg[, 1:3], Run, Log2Intensity)
rownames(iprg2) <- iprg2$Protein
iprg2 <- iprg2[, -1]
And lets focus on the 3 runs, i.e. 2 replicates from condition 1 and
x <- iprg2[, c(1, 2, 10)]
head(x)
## JD_06232014_sample1-A.raw JD_06232014_sample1_B.raw
## sp|D6VTK4|STE2_YEAST 26.58301 26.81232
## sp|O13297|CET1_YEAST 24.71809 24.71912
## sp|O13329|FOB1_YEAST 23.47075 23.37678
## sp|O13539|THP2_YEAST 24.29661 27.52021
## sp|O13547|CCW14_YEAST 27.11638 27.22234
## sp|O13563|RPN13_YEAST 26.17056 26.09476
## JD_06232014_sample4-A.raw
## sp|D6VTK4|STE2_YEAST 26.65573
## sp|O13297|CET1_YEAST 24.50814
## sp|O13329|FOB1_YEAST 23.03473
## sp|O13539|THP2_YEAST 25.07576
## sp|O13547|CCW14_YEAST 27.07526
## sp|O13563|RPN13_YEAST 25.77958
pairs(x)
We can use the cor function to calculate the Pearson correlation between two vectors of the same length (making sure the order is correct), or a dataframe. But, we have missing values in the data, which will stop us from calculating the correlation:
cor(x)
## JD_06232014_sample1-A.raw
## JD_06232014_sample1-A.raw 1
## JD_06232014_sample1_B.raw NA
## JD_06232014_sample4-A.raw NA
## JD_06232014_sample1_B.raw
## JD_06232014_sample1-A.raw NA
## JD_06232014_sample1_B.raw 1
## JD_06232014_sample4-A.raw NA
## JD_06232014_sample4-A.raw
## JD_06232014_sample1-A.raw NA
## JD_06232014_sample1_B.raw NA
## JD_06232014_sample4-A.raw 1
We first need to omit the proteins/rows that contain missing values
x2 <- na.omit(x)
cor(x2)
## JD_06232014_sample1-A.raw
## JD_06232014_sample1-A.raw 1.0000000
## JD_06232014_sample1_B.raw 0.9794954
## JD_06232014_sample4-A.raw 0.9502142
## JD_06232014_sample1_B.raw
## JD_06232014_sample1-A.raw 0.9794954
## JD_06232014_sample1_B.raw 1.0000000
## JD_06232014_sample4-A.raw 0.9502517
## JD_06232014_sample4-A.raw
## JD_06232014_sample1-A.raw 0.9502142
## JD_06232014_sample1_B.raw 0.9502517
## JD_06232014_sample4-A.raw 1.0000000
It is often assumed that high correlation is a halmark of good replication. Rather than focus on the correlation of the data, a better measurement would be to look a the log2 fold-changes, i.e. the distance between or repeated measurements. The ideal way to visualise this is on an MA-plot:
par(mfrow = c(1, 2))
r1 <- x2[, 1]
r2 <- x2[, 2]
M <- r1 - r2
A <- (r1 + r2)/2
plot(A, M); grid()
suppressPackageStartupMessages(library("affy"))
affy::ma.plot(A, M)
See also this post on the Simply Statistics blog.
abline(0, 1) can be used to add a line with intercept 0 and slop 1. It we want to add the line that models the data linearly, we can calculate the parameters using the lm function:
lmod <- lm(r2 ~ r1)
summary(lmod)
##
## Call:
## lm(formula = r2 ~ r1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4939 -0.0721 0.0126 0.0881 3.4595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.348190 0.091842 3.791 0.000153 ***
## r1 0.985878 0.003688 267.357 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3263 on 3024 degrees of freedom
## Multiple R-squared: 0.9594, Adjusted R-squared: 0.9594
## F-statistic: 7.148e+04 on 1 and 3024 DF, p-value: < 2.2e-16
which can be used to add the adequate line that reflects the (linear) relationship between the two data
plot(r1, r2)
abline(lmod, col = "red")
As we have seen in the beginning of this section, it is essential not to rely solely on the correlation value, but look at the data. This also holds true for linear (or any) modelling, which can be done by plotting the model:
par(mfrow = c(2, 2))
plot(lmod)
Cook’s distance is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis and can be used to highlight points that particularly influence the regression.
Leverage quantifies the influence of a given observation on the regression due to its location in the space of the inputs.
See also ?influence.measures.
Challenge
- Take any of the
iprg2replicates, model and plot their linear relationship. Theiprg2data is available as anrdafile, or regenerate it as shown above.- The Anscombe quartet is available as
anscombe. Load it, create a linear model for one \((x_i, y_i)\) pair of your choice and visualise/check the model.
x3 <- anscombe[, 3]
y3 <- anscombe[, 7]
lmod <- lm(y3 ~ x3)
summary(lmod)
##
## Call:
## lm(formula = y3 ~ x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1586 -0.6146 -0.2303 0.1540 3.2411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0025 1.1245 2.670 0.02562 *
## x3 0.4997 0.1179 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
par(mfrow = c(2, 2))
plot(lmod)
Finally, let’s conclude by illustrating how ggplot2 can very elegantly be used to produce similar plots, with useful annotations:
library("ggplot2")
dfr <- data.frame(r1, r2, M, A)
p <- ggplot(aes(x = r1, y = r2), data = dfr) + geom_point()
p + geom_smooth(method = "lm") +
geom_quantile(colour = "red")
Challenge
Replicate the MA plot above using
ggplot2. Then add a non-parametric lowess regression usinggeom_smooth().
p <- ggplot(aes(x = A, y = M), data = dfr) + geom_point()
p + geom_smooth() + geom_quantile(colour = "red")
## `geom_smooth()` using method = 'gam'
## Smoothing formula not specified. Using: y ~ x
Back to course home page